OCPBUGS-84308: fix(cpo) delete terminated MCD pods to retry in-place upgrades#8434
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughreconcileInPlaceUpgrade now returns errors from reconcileUpgradePods as “failed to reconcile upgrade pods”. reconcileUpgradePods was extended to detect upgrade Machine Config Daemon pods in Succeeded or Failed phases when the corresponding node still needs an upgrade; it logs the detection, deletes the terminated pod (ignoring NotFound), and relies on subsequent reconciles to recreate the pod. Existing behavior for Running pods, creating missing pods, and deleting idle pods for fully updated nodes is covered by a new TestReconcileUpgradePods unit test. Sequence Diagram(s)sequenceDiagram
participant Controller as Controller
participant API_Server as API Server
participant Node as Node
participant Pod as Upgrade Pod
Controller->>API_Server: Get upgrade Pod for node
API_Server-->>Controller: Return Pod (Running | Succeeded | Failed | NotFound)
alt Pod is Running
Controller->>Controller: Leave Pod unchanged
else Pod is Succeeded or Failed and Node needs upgrade
Controller->>API_Server: Log detection and Delete Pod
API_Server-->>Controller: Delete response (Success / NotFound / Error)
Controller->>Controller: Retry path will recreate pod later
else Pod NotFound
Controller->>API_Server: Create upgrade Pod
API_Server-->>Controller: Create response (Success / Error)
end
🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8434 +/- ##
==========================================
+ Coverage 40.69% 41.32% +0.63%
==========================================
Files 755 755
Lines 93373 93446 +73
==========================================
+ Hits 37994 38618 +624
+ Misses 52646 52081 -565
- Partials 2733 2747 +14
... and 35 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go (1)
736-738: ⚡ Quick winTighten deleted-pod assertion to
NotFoundinstead of any error.
HaveOccurred()can pass for unrelated failures. AssertingIsNotFoundmakes the test intent explicit and failures clearer.Proposed test hardening
+import apierrors "k8s.io/apimachinery/pkg/api/errors" ... if tc.expectPodDeleted { g.Expect(getErr).To(HaveOccurred(), "expected pod to be deleted") + g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod get to return NotFound after deletion") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go` around lines 736 - 738, Replace the loose assertion g.Expect(getErr).To(HaveOccurred()) for deleted pods with a NotFound-specific check: import k8s.io/apimachinery/pkg/api/errors as apierrors (or errors alias used elsewhere) and replace the assertion with g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod to be NotFound") when tc.expectPodDeleted is true, referencing the tc.expectPodDeleted branch and the getErr variable so the test fails only for a NotFound error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 736-738: Replace the loose assertion
g.Expect(getErr).To(HaveOccurred()) for deleted pods with a NotFound-specific
check: import k8s.io/apimachinery/pkg/api/errors as apierrors (or errors alias
used elsewhere) and replace the assertion with
g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod to be
NotFound") when tc.expectPodDeleted is true, referencing the tc.expectPodDeleted
branch and the getErr variable so the test fails only for a NotFound error.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 21221ab1-79e2-4d7c-8429-c9fb954b5229
📒 Files selected for processing (2)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 762-764: The test currently checks for a deleted pod using a broad
error assertion (g.Expect(getErr).To(HaveOccurred()) when tc.expectPodDeleted is
true); change this to assert specifically that the error is a NotFound error by
using the Kubernetes API errors helper on the getErr variable (e.g., assert
apierrors.IsNotFound(getErr) via the testing framework) so the deleted-pod
branch only passes for NotFound and not for other client/read errors; locate the
check guarded by tc.expectPodDeleted in inplaceupgrader_test.go and replace the
HaveOccurred() assertion with a specific IsNotFound assertion referencing
getErr.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 8b8c3bd2-d52a-403e-a20b-a5bbb89688c9
📒 Files selected for processing (1)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
b5637a4 to
df176c0
Compare
|
/jira refresh |
|
@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhuynh@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhuynh@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go (1)
352-363: ⚡ Quick winDeleted-pod retry has no requeue guarantee — upgrade may stall.
After the terminated pod is deleted,
reconcileUpgradePodsreturnsnil,reconcileInPlaceUpgradereturnsnil, andReconcilereturnsctrl.Result{}(no requeue). Because the deletion doesn't mutate any node annotation, no node-watch event fires to trigger a follow-up reconciliation. If no other MachineSet event arrives, the replacement pod is never created and the upgrade stalls indefinitely — which is exactly the problem this PR is fixing.Consider either propagating a boolean "needs requeue" flag back up through
reconcileInPlaceUpgradetoReconcile, or returningctrl.Result{RequeueAfter: ...}whenever at least one pod was deleted:💡 Sketch of the fix
-func (r *Reconciler) reconcileUpgradePods(...) error { +func (r *Reconciler) reconcileUpgradePods(...) (bool, error) { ... + podDeleted := false ... } else if pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed { ... if err := hostedClusterClient.Delete(ctx, pod); err != nil { ... - return fmt.Errorf("error deleting terminated upgrade MCD pod for node %s: %w", node.Name, err) + return false, fmt.Errorf("error deleting terminated upgrade MCD pod for node %s: %w", node.Name, err) } + podDeleted = true } ... - return nil + return podDeleted, nil }And in
reconcileInPlaceUpgrade/Reconcile, propagate the flag to returnctrl.Result{RequeueAfter: 5 * time.Second}.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go` around lines 352 - 363, reconcileUpgradePods currently deletes terminated upgrade pods but returns nil which causes reconcileInPlaceUpgrade and Reconcile to not requeue and the replacement pod may never be created; change reconcileUpgradePods to return a (bool, error) or similar indicator (e.g., deletedPod bool) when it deletes at least one pod, update reconcileInPlaceUpgrade to propagate that flag up, and have Reconcile return ctrl.Result{RequeueAfter: 5 * time.Second} (or another short duration) whenever the flag indicates a pod was deleted so the controller will immediately requeue and create the replacement pod.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 692-715: Update the test case that sets existingPod with a
DeletionTimestamp and Finalizers so it actually verifies the "skip" behavior
instead of just checking getErr; in the assertion block that currently checks
getErr (references variables existingPod, expectPodSkipped and the retrieved pod
variable), either assert that the retrieved pod's DeletionTimestamp is non-nil
(e.g., pod.DeletionTimestamp != nil) to prove we hit the skip path, or
replace/add a fake-client interceptor (WithInterceptorFuncs) to spy on Delete
and assert Delete was never called for that pod — do not rely solely on getErr.
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`:
- Around line 352-363: reconcileUpgradePods now deletes both idle and terminated
pods but the error wrap at the caller still says "failed to delete idle upgrade
pods", which is misleading; update the error wrapping at the call site that
wraps the error from hostedClusterClient.Delete (the delete call inside
reconcileUpgradePods) to use a neutral message like "failed to delete upgrade
pod for node %s" or include the pod phase/node context so failures deleting
terminated pods are accurately described; adjust the fmt.Errorf wrapper (the
existing "failed to delete idle upgrade pods" message) to reference the upgrade
pod deletion generically (or include pod.Status.Phase) so logs reflect the
actual deletion target.
---
Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`:
- Around line 352-363: reconcileUpgradePods currently deletes terminated upgrade
pods but returns nil which causes reconcileInPlaceUpgrade and Reconcile to not
requeue and the replacement pod may never be created; change
reconcileUpgradePods to return a (bool, error) or similar indicator (e.g.,
deletedPod bool) when it deletes at least one pod, update
reconcileInPlaceUpgrade to propagate that flag up, and have Reconcile return
ctrl.Result{RequeueAfter: 5 * time.Second} (or another short duration) whenever
the flag indicates a pod was deleted so the controller will immediately requeue
and create the replacement pod.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7df03c82-4975-43fe-9170-34a23bcc9534
📒 Files selected for processing (2)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
99fa86c to
fea50de
Compare
|
@jparrill , Addressed all the comments and have tested E2E testing manually by building the quay.io/rhn_support_psingour/hypershift:OCPBUGS-84308-cpo image. Test results https://drive.google.com/file/d/1JQ801Zs8x_WrnV-XNbAgU0zd06tvm80J/view?usp=drive_link Test Scenarios Verified
|
csrwng
left a comment
There was a problem hiding this comment.
Review of terminated MCD pod handling and helper extraction.
| return nil | ||
| } | ||
|
|
||
| func deleteUpgradePodIfExists(ctx context.Context, c client.Client, pod *corev1.Pod) error { |
There was a problem hiding this comment.
deleteUpgradePodIfExists duplicates support/k8sutil.DeleteIfNeeded — same Get/DeletionTimestamp/Delete/IsNotFound pattern. Consider using the existing utility instead of introducing a new helper.
…grades When an in-place MCD upgrade pod terminates (Failed/Succeeded) but the node still needs an upgrade, the controller now deletes the terminated pod so a fresh one can be recreated on the next reconcile loop. A periodic requeue (upgradeRequeueInterval = 30s) ensures the controller re-evaluates nodes that still need upgrades rather than waiting for an external event. Additionally: - Extract deleteUpgradePodIfExists helper to reduce duplication across reconcileUpgradePods and deleteUpgradeManifests - Add test coverage for PodPending phase, multi-node mixed states, NotFound on Delete, RequeueAfter assertion, and Delete failure scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the local deleteUpgradePodIfExists helper with the shared k8sutil.DeleteIfNeeded utility to reduce duplication and improve consistency across the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fea50de to
9092cca
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jparrill, PoornimaSingour The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
Scheduling tests matching the |
|
/verified by me Test Results - https://drive.google.com/drive/u/0/folders/1LqG7eXZuFFzwiQWHTzDbxgCTQLwZPWIL
|
|
@PoornimaSingour: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@PoornimaSingour: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@PoornimaSingour: Jira Issue Verification Checks: Jira Issue OCPBUGS-84308 Jira Issue OCPBUGS-84308 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-09-022526 |
|
/jira backport release-4.22 |
|
@PoornimaSingour: Failed to create backported issues: An error was encountered cloning bug for cherrypick for bug OCPBUGS-84308 on the Jira server at https://redhat.atlassian.net. No known errors were detected, please see the full error message for details. Full error message.
request failed. Please analyze the request body for more details. Status code: 400: {"errorMessages":[],"errors":{"customfield_10980":"Field does not support update 'customfield_10980'","customfield_10978":"Field does not support update 'customfield_10978'","customfield_10979":"Field does not support update 'customfield_10979'"}}
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira backport release-4.22 |
|
@bradmwilliams: The following backport issues have been created:
Queuing cherrypicks to the requested branches to be created after this PR merges: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@openshift-ci-robot: #8434 failed to apply on top of branch "release-4.22": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.
Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.
Which issue(s) this PR fixes:
Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit
Bug Fixes
Tests